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1  Introduction 


The  organization  of  communication  among  chips  is  a  major  concern  in  the  design  of  an 
electronic  system.  Because  of  the  costs  associated  with  wiring  and  packaging,  it  is  generally 
desirable  to  minimize  the  number  of  wires  and  the  number  of  pins  per  chip  in  an  architec¬ 
ture.  This  paper  investigates  how  busses  (multiple-pin  wires)  can  be  employed  to  efficiently 
implement  various  communication  patterns  among  a  set  of  chips.  Other  theoretical  studies 
of  bussed  interconnections  can  be  found  in  [1,  3,  4,  5,  7,  12,  21,  24,  25,  29]. 

Pdrhaps  the  simplest  example  of  the  advantage  of  bussed  interconnections  is  the  use  of 
a  single  shaded  bus  to  communicate  between  any  pair  of  chips  connected  to  the  bus  in  one 
clock  tick.  Communicating  between  any  pair  of  chips  in  one  clock  tick  can  be  implemented 
with  two-pin  wires,  but  any  such  scheme  requires  wires  and  n  —  1  pins  per  chip.^  Of 
course,  a  two-pin  interconnection  scheme  may  be  able  to  implement  more  communication 
patterns,  but  if  we  are  only  interested  in  communication  between  individual  pairs,  the 
additional  power,  which  comes  at  a  high  cost,  is  wasted. 

An  example  that  better  illustrates  the  ideas  in  this  paper  comes  from  the  problem  of 
building  a  fast  cyclic  shifter  (sometimes  called  a  barrel  shifter)  on  n  chips.  Initially,  each 
chip  c  contains  a  one-bit  value  Cc-  The  function  of  the  shifter  is  to  move  each  bit  Cc  to  chip 
c  -1-  s  (modn)  in  one  clock  tick,  where  s  can  be  any  value  between  0  and  n  —  1. 

Any  cyclic  shifter  that  uses  only  two-pin  wires  requires  at  least  wires  and  n  —  1  pins 
per  chip  in  order  to  shift  in  one  clock  tick  because  each  chip  must  be  able  to  communicate 
directly  with  each  of  the  other  n  —  1  chips.  Using  busses,  however,  we  can  do  much  better. 
Figure  1  gives  an  architecture  for  a  cyclic  shifter  on  13  chips  which  uses  13  busses  and  only 
4  pins  per  chip.  To  realize  a  shift  by  8,  for  example,  each  chip  writes  its  bit  to  pin  3  and 
reads  from  pin  1.  The  reader  may  verify  that  all  other  cyclic  shifts  among  the  chips  are 
possible  in  one  clock  tick.  (In  Section  4,  we  give  a  general  method  for  constructing  such 
cyclic  shifters  beised  on  finite  projective  planes.) 


Figure  1:  A  cyclic  shifter  on  13  chips  that  uses  13  busses.  Each  chip  has  4  pins,  and  each  bus 
has  4  chips  connected  to  it.  This  cyclic  shifter  is  based  on  the  difference  cover  {0, 1,3,9}  for  Z13. 

The  cyclic  shifter  of  Figure  1  has  the  advantage  of  uniformity.  All  chips  have  exactly 
the  same  number  of  pins,  and  to  accomplish  each  of  the  13  permutations  specified  by  the 

*  Unless  otherwise  specified,  we  count  only  data  pins  in  our  analysis  and  omit  consideration  of  the  pins 
for  control,  clock,  power,  and  ground  since  they  are  needed  by  all  implementations. 
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problem,  all  chips  write  to  (and  read  from)  pins  with  identical  labels.  For  all  busses,  the 
number  of  pins  per  bus  is  4,  which  is  the  same  as  the  number  of  pins  per  chip.  Moreover, 
the  connections  between  chips  and  busses  follow  a  periodic  pattern.  The  uniformity  of  the 
architecture  leads  to  simplicity  in  the  control  of  the  system.  Four  control  wires  from  a 
central  controller  are  sufficient  to  determine  each  of  the  13  shifts — two  wires  for  specifying 
the  number  of  the  pin  on  which  to  write,  and  two  for  the  pin  to  read — which  is  the 
minimum  possible.  Thus,  our  control  scheme  uses  the  minimum  number  of  control  pins, 
and  the  on-chip  decoding  logic  is  straightforward  and  identical  for  all  the  chips. 

Cyclic  shifters  for  general  n  can  be  constructed  using  an  idea  from  combinatorial  math¬ 
ematics  related  to  difference  sets  [18,  p.  121].  (See  also  [6,  14,  16,  22,  26].) 

Definition  1  A  subset  Z?  C  Z„  of  the  integers  modulo  n  is  a  difference  cover  for  Z„  if  for 
all  s  €  Z„,  there  exist  d,,  dj  ^  D  such  that  s  =  d,-  —  dj  (mod  n). 

That  is,  every  integer  in  Zn  can  be  represented  as  the  difference  modulo  n  of  two  integers 
in  D.  For  example,  the  set  D  =  {0, 1,3,9}  is  a  difference  cover  for  Z13,  since 

0  =  0-0 

1  =  1-0 

2  =  3-1 

3  =  3-0 

4  =  0-9 

5  =  1-9 

6  =  9-3 

7  =  3-9 

8  =  9-1 

9  =  9-0 

10  =  0-3 

11  =  1-3 

12  =  0  -  1  , 

where  all  subtractions  are  performed  modulo  13. 

Given  a  difference  cover  for  Z„  with  k  elements,  a  cyclic  shifter  on  n  chips  with  n  busses 
and  k  pins  per  chip  can  be  constructed.  Suppose  D  =  {do,  di, . . . ,  d*_i}  is  a  difference 
cover  for  Z„.  In  the  cyclic  shifter,  chip  c  connects  via  its  pin  i  to  bus  c  +  d,  (modn),  for 
all  c  =  0, 1, . . .  ,n  —  1  and  f  =  0, 1, . . . ,  A:  —  1.  To  see  that  any  cyclic  shift  on  the  n  chips 
can  be  uniformly  realized,  consider  a  cyclic  shift  by  $.  Since  D  \s  a  difference  cover  for  Z„. 
there  exist  d,, dj  G  D  such  that  s  =  di  ~  dj  (modn).  To  realize  the  shift  by  s,  each  chip 
writes  to  pin  i  and  reads  from  pin  j.  Chip  c  therefore  writes  onto  bus  c  +  di,  and  bus  c  +  d, 
is  read  by  chip  (c  -f-  di)  —  dj  =  c  +  s.  No  collisions  occur  because  each  bus  has  exactly  one 
pin  labeled  i  and  one  pin  labeled  j  connected  to  it,  as  can  be  verified. 

The  remainder  of  this  paper  explores  permu^^ation  architectures,  the  properties  of 
multiple-pin  interconnections,  and  related  combinatorial  mathematics.  In  Section  2  wi- 
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define  a  permutation  architecture,  introduce  the  notion  of  uniformity,  and  prove  some  ba¬ 
sic  properties  of  architectures  that  employ  busses  to  realize  arbitrary  sets  of  permutations. 
Section  3  defines  the  notion  of  a  difference  cover  for  a  set  of  permutations,  relates  it  to 
the  notion  of  a  uniform  permutation  architecture,  and  proves  some  properties  of  difference 
covers.  In  Section  4  we  show  how  to  build  cyclic  shifters  that  are  provably  efficient.  Sec¬ 
tion  5  investigates  how  to  design  small  difference  covers  for  any  set  of  permutations  that 
forms  a  finite  group.  In  Section  6  we  extend  the  discussion  to  uniform  architectures  that 
realize  permutations  in  more  than  one  clock  tick.  We  present  a  variety  of  extensions  to  the 
results  of  the  paper  in  Section  7.  Finally,  in  Section  8  we  discuss  questions  left  open  by 
our  research.  An  appendix  of  standard  notations  and  definitions  is  included  for  reference. 
Notations  and  definitions  more  specific  to  the  content  of  the  paper  are  provided  in  context. 

2  Permutation  architectures 

In  this  section  we  formally  define  the  notion  of  a  permutation  architecture,  and  we  make 
precise  the  notion  of  uniformity.  We  also  prove  some  basic  properties  of  permutation 
architectures  that  realize  arbitrary  sets  of  permutations.  The  definitions  in  this  section  are 
somewhat  intricate  and  tedious,  and  are  indicative  of  the  difficulties  faced  in  the  design  of 
efficient  permutation  architectures.  In  the  next  section,  however,  we  use  these  definitions 
to  show  that  reasoning  about  v  liform  permutation  architectures  is  essentially  equivalent 
to  reasoning  about  difference  covers,  a  simpler  and  more  elegant  mathematical  notion.  The 
remainder  of  the  paper  then  uses  the  simpler  notion. 

For  convenience,  we  adopt  a  few  notational  conventions.  We  use  multiplicative  notation 
to  denote  composition  of  permutations.  The  inverse  of  a  permutation  tt  is  denoted  by 
t"'.  Composition  of  functions  is  performed  in  right-*')- left  order,  so  that  7ri7r2  is  defined 
by  7ri;r2Z  =  7ri(7r2(i)).  The  identity  permutation  on  n  elements  is  denoted  by  or 
by  I  if  the  number  of  elements  is  unimportant.  For  a  permutation  set  we  denote 
by  the  set  of  all  the  inverses  of  the  permutations  of  $,  i.e.  =  {(j>~ 

For  two  permutation  sets  and  'P,  the  notation  is  used  to  denote  the  permutation 
set  :  4>  ^  ^  and  6  ^}.  We  use  the  notation  [n]  to  denote  the  set  of  n  integers 
{0,1, ...,n-  1}. 

We  first  define  the  notion  of  a  permutation  architecture. 

Definition  2  A  permutation  architecture  is  a  6-tuple  A  =  (C,  B,  P,  CHIP,  BUS,  LABEL)  as 
follows. 

1.  C  is  a  set  of  chips] 

2.  P  is  a  set  of  busses] 

3.  P  is  a  set  of  pins] 

4.  CHIP  is  a  function  CHIP  :  P  —*  C] 
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5.  BUS  is  a  function  BUS  :  P  —*  B; 

6.  LABEL  is  a  function  LABEL  :  P  -»  N,  where  \{  x,y  e  P,  x  ^  y,  and  CHlP(x)  = 
CHIP(l/),  then  LABEL(x)  ^  LABEL(t/). 

The  set  C  contains  all  the  chips  in  the  architecture,  and  the  set  B  contains  all  the  busses. 
Which  chips  are  connected  to  which  busses  is  determined  by  the  pins  they  have  in  common; 
the  set  P  contains  all  the  pins.  The  function  CHIP  determines  which  pins  belong  to  which 
chips.  Similarly,  the  function  BUS  determines  which  pins  are  interconnected  by  which  bus. 
The  function  LABEL  names  the  pins  on  the  chips  by  natural  numbers  such  that  all  pins  on 
a  given  chip  have  distinct  labels,  which  we  shall  sometimes  call  pin  numbers. 

Our  formal  definition  of  a  permutation  architecture  omits  several  subsystems  that  tech¬ 
nically  should  be  included,  but  whose  inclusion  is  not  germane  to  our  study.  These  sub¬ 
systems  include  a  control  network  that  specifies  what  permutation  is  to  be  performed  and 
clocking  circuitry  for  synchronization.  Our  focus  is  on  the  structure  of  the  bussed  inter¬ 
connections  for  permuting  the  data,  cind  thus  our  definition  encompasses  only  this  aspect 
of  the  architecture. 

We  now  define  what  it  means  for  a  permutation  architecture  to  realize  a  permutation. 

Definition  3  A  permutation  architecture  A  =  (C,jB,P, CHIP,  BUS,  LABEL)  realizes  a  per¬ 
mutation  TT  ;  C  — >  C  if  there  exist  two  functions  WRITE,  :  C  P  and  READ,  :  C  —*  P, 
such  that  for  any  chips  c,ci,C2  €  C,  we  have: 

1.  CHIP(READ,(c))  =  CHIP(WRlTE,(c))  =  c; 

2.  BUS(WRITE,(c))  =  BUS(READ,(7r(c))); 

3.  Cl  ^  C2  implies  BUS(WRITE,(ci))  BUS(WRITE,(C2)). 

The  architecture  uniformly  realizes  tt  if,  in  addition: 

4.  label(write,(ci))  =  label(write,(c2)); 

5.  LABEL(rEAD,(ci))  =  LABEL(READ,(c2)). 

We  say  a  permutation  architecture  realizes  a  set  IT  of  permutations  if  it  realizes  every 
permutation  in  IT.  We  say  it  uniformly  realizes  11  if  it  uniformly  realizes  every  permutation 
in  n. 

Intuitively,  for  a  permutation  tt,  the  functions  WRITE,  and  READ,  identify  the  write 
pin  and  the  read  pin  for  each  chip.  Condition  1  makes  sure  that  each  chip  writes  and  reads 
pins  that  are  connected  to  it.  Condition  2  ensures  that  the  bus  to  which  chip  c  writes  is 
read  by  chip  7r(c).  Condition  3  guarantees  that  no  collisions  occur,  that  is,  no  two  data 
transfers  use  the  same  bus.  The  architecture  uniformly  realizes  a  permutation  (Conditions 
4  and  5)  if  all  chips  write  to  pins  with  the  same  pin  number  and  read  from  pins  with  the 
same  pin  number,  as  in  the  cyclic  shifter  from  Figure  1. 
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Our  definition  of  a  permutation  architecture  implies  that  “complete”  permutations  are 
to  be  realized,  that  is,  every  chip  sends  exactly  one  datum  and  receives  exactly  one  datum. 
Moreover,  an  interconnection  is  required  even  when  a  chip  sends  a  datum  to  itself.  Since 
no  collisions  occur,  the  number  of  busses  in  the  architecture  must  be  at  least  the  number 
of  chips.  This  observation  leads  directly  to  the  following  theorem. 

Theorem  1  In  any  permutation  architecture  that  realizes  some  nonempty  permutation  set 
n,  the  average  number  of  pins  per  bus  is  at  most  the  average  number  of  pins  per  chip. 

Proof.  Let  A  =  (C,  B,  P,  CHIP,  BUS,  LABEL)  be  a  permutation  architecture  for  11.  The 
average  number  of  pins  per  chip  is  1P|  /  |C|,  and  the  average  number  of  pins  per  bus  is 
!P|  /  |B|.  Condition  3  of  Definition  3  says  that  for  any  permutation  tt  G  IT,  any  two  distinct 
chips  are  mapped  to  distinct  busses.  Consequently,  we  get  that  |B|  >  |C|,  which  proves 
the  theorem.  B 

Under  the  assumption  that  no  interconnection  is  needed  for  a  chip  to  send  data  to 
itself.  Theorem  1  is  no  longer  applicable.  A  similar  theorem  can  be  proved  for  this  model, 
however,  which  involves  the  number  of  fixed  points  in  the  permutations  realized  by  the 
architecture.  Specifically,  suppose  the  architecture  realizes  a  set  11  of  permutations.  Define 
the  rank  of  a  permutation  tt  G  11  as  RANK(jr)  =  |{c  G  C  :  ir{c)  ^  c}|,  and  define  the  rank 
of  the  permutation  set  11  as  RANK(n)  =  max,gn  RANK(ir).  The  analogue  to  Theorem  1 
states  that  the  ratio  between  the  average  number  of  pins  per  bus  and  the  average  number 
of  pins  per  chip  is  at  most  \C\ /RANK(n). 

In  any  architecture  A  that  uniformly  realizes  a  permutation  set  11,  the  number  of  pins 
that  are  actually  used  to  uniformly  realize  11  is  the  same  for  all  chips,  and  additional  pins 
on  a  chip  are  unused.  Furthermore,  the  number  of  busses  used  in  realizing  any  permutation 
TT  G  11  is  equal  to  the  number  of  chips.  These  observations  lead  to  the  following  definition 
of  a  uniform  architecture. 

Definition  4  A  uniform  permutation  architecture  for  a  permutation  set  11  is  a  permuta¬ 
tion  architecture  A  =  (C,  S,  F,  CHIP,  BUS,  LABEL)  such  that: 

1.  A  uniformly  realizes  11, • 

2.  l{i  G  F  :  CHIP(p)  =  ci}l  =  |{z  G  F  :  CHIP(p)  =  Cj}!  for  any  two  chips  ci,C2  €  C; 

3.  \B\  =  tq; 

4.  if  I  ^  y  and  LABEL(i)  =  LABEL(j/),  then  BUS(z)  ^  BUS(j/). 

Thus,  all  the  chips  in  a  uniform  permutation  architecture  have  the  same  number  of  pins 
(Condition  2),  the  number  of  busses  is  equal  to  the  number  of  chips  (Condition  3),  and 
the  labels  of  the  pins  on  any  bus  are  distinct  (Condition  4). 

The  following  theorem  demonstrates  that  any  permutation  architecture  that  uniformlv 
realizes  some  permutation  set  11  can  be  made  into  a  uniform  architecture. 
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Theorem  2  Ltt  A  =  (C,  f*, CHIP,  BUS,  label)  be  a  permutation  architecture  that  uni¬ 

formly  realizes  the  permutation  set  11,  and  let  k  be  the  smallest  number  of  pins  on  any 
chip  in  C.  Then  there  is  a  uniform  architecture  A'  =  (C',  B',  P',  CHIP',  BUS',  LABEL')  for 
n  with  at  most  k  pins  per  chip. 


Proof.  We  construct  the  uniform  architecture  A!  from  the  permutation  architecture 
A  in  two  steps.  First,  we  construct  an  intermediate  permutation  architecture 
A!'  =  (C",  B",  P",  chip",  bus",  label")  by  removing  extraneous  pins  from  chips  in  A 
such  that  all  chips  end  up  with  the  same  number  of  pins  per  chip  and  such  that  each  pin 
plays  a  role  in  uniformly  realizing  11.  Then,  the  busses  of  A!'  are  reorganized  to  produce 
the  architecture  A!  in  such  a  way  that  the  number  of  busses  in  A1  is  equal  to  the  number 
of  chips.  We  assume  that  the  permutation  set  IT  is  nonempty,  since  otherwise  the  theorem 
is  trivial. 

In  the  first  step,  we  remove  pins  that  are  unused  in  uniformly  realizing  IT.  Since  A 
uniformly  realizes  IT,  each  permutation  ?r  €  IT  can  be  associated  with  a  distinct  pair  [i.f) 
of  pin  labels  corresponding  to  the  labels  that  all  chips  write  to  and  read  from  in  order  to 
realize  tt.  A  pin  is  unused  if  its  label  does  not  appear  in  any  of  these  |n|  pairs.  Removing 
the  unused  pins  results  in  the  architecture  Al‘  in  which  all  chips  have  the  same  number 
of  pins,  since  each  chip  has  exactly  one  pin  for  each  label  used  in  uniformly  realizing  11. 
The  permutation  architecture  >1"  uniformly  realizes  TI,  and  furthermore,  each  pin  is  used 
in  uniformly  realizing  some  tt  €  IT.  If  we  let  s  denote  the  number  of  pins  per  chip  in  A!' . 
then  we  have  s  <  fc,  since  originally  at  least  one  chip  had  k  pins  and  no  pins  were  added. 

In  the  second  step,  we  reorganize  the  busses  of  A'  to  produce  the  uniform  architecture 
A  in  which  the  number  of  busses  is  equal  to  the  number  of  chips.  For  any  permutation 
architecture  that  realizes  a  nonempty  permutation  set,  the  number  of  busses  is  ne\er 
smaller  that  the  number  of  chips.  Assume  without  loss  of  generality  that  C"  =  [n]. 
B"  =  [m],  and  range(LABEL")  =  [s].  The  theorem  is  proved  if  the  architecture  A"  uses 
only  n  =  \C"\  busses,  but  in  general,  the  architecture  might  use  m  >  n  busses. 

We  define  a  collection  of  mappings  ^  =  { 00,  ,  •  •  • ,  1 } ,  where  for  each  0  <  z  <  s  —  1 . 

the  mapping  0i  :  [n]  -+  [m]  is  defined  to  be  0,(c)  =  6  if  and  only  if  chip  c  €  C"  is  connected 
via  its  pin  number  i  to  bus  b  €  B".  The  elements  of  'P  are  indeed  mappings  since  each 
chip  has  a  pin  numbered  i  for  each  0  <  z  <  s  —  1.  The  mappings  are  injective  (one-to-or.e). 
since  otherwise  two  pins  with  the  same  pin  number  would  be  connected  to  the  same  bus. 
and  both  pins  could  not  be  used  to  uniformly  realize  permutations,  thereby  violating  the 
construction  of  A"  in  the  first  step.  The  collection  'I'  is  a  multiset,  since  it  may  be  that 
two  different  pin  numbers  i  ^  j  define  the  same  mapping  (i.e.  xpi  =  ipj).  The  key  idea  is 
that  any  permutation  is  implemented  by  each  chip  writing  to  pin  i  and  reading  from  pin 
thereby  employing  the  mapping  0j  to  write  data  from  the  n  chips  to  n  distinct  busses,  and 
the  inverse  of  the  mapping  0j  to  read  data  from  the  same  n  busses  back  to  the  n  chips. 

We  now  show  how  to  reorganize  the  busses  of  A"  in  order  to  construct  a  uniform 
architecture  A.  We  partition  into  /  equivalence  classes  'I'o  U  'I'l  U  •  •  •  U  't/_i  such  that 
0i  and  0j  are  in  the  same  equivalence  class  if  and  only  if  range(0,)  =  range(0j).  This 
partitioning  hats  the  property  that  if  ir  6  IT,  then  there  exists  an  r  such  that  tt  =  ii'f'r, 


7 


where  t/>,,  Tpj  6  ^r-  (Recall  that  the  inverse  of  an  injective  mapping  0  ;  [n]  — ►  [m]  is  defined 
as  the  mapping  V’"'  :  range(0)  [n]  such  that  if  xl>{c)  =  b,  then  =  c.)  For  each  0  < 

r  <  /— 1,  pick  a  bijection  (one-to-one,  onto)  /,  :  range(0)  — ♦  [n],  where  0  is  any  mapping  in 
(We  can  pick  a  bijection,  since  0  is  injective,  which  implies  |range(0)|  =  n.)  We  define 
the  architecture  A'  by  C  =  C",  B'  =  [n],  P'  =  P",  CHIP'  =  CHIP",  LABEL'  =  LABEL",  and 
for  any  pin  i  €  P'  such  that  0laBEL'(x)  ^  we  define  BUS'(x)  =  /r(BUS"(x)). 

The  architecture  A'  has  exactly  s  pins  per  chip  and  satisfies  \B'\  =  \C'\  =  n,  thereby 
satisfying  Conditions  2  and  3  of  Definition  4.  We  show  Condition  4  holds  by  considering 
any  two  pins  x  and  y  with  LABEL'(x)  =  LABEL'(y)  =  i.  We  have  BUS'(x)  =  /r(BUS"(x)) 
and  BUS'(i/)  =  /r(BUS"(y))  for  some  fr  as  defined  in  the  previous  paragraph.  Since  is 
an  injective  mapping  and  because  Condition  4  of  Definition  4  holds  for  A" ,  we  then  have 
X  ^  y  implies  BUS'(x)  ^  BUs'(j/). 

It  remains  to  show  that  Condition  1  of  Definition  4  holds,  that  is,  that  «4'  uniformly 
realizes  IT.  Consider  any  permutation  tt  G  IT.  Since  A!'  uniformly  realizes  TI,  there  exists  a 
pair  of  pin  labels  (0  j)  such  that  tt  is  realized  in  A!'  by  each  chip  writing  to  its  pin  numbered 
i  and  reading  from  its  pin  numbered  j.  We  use  the  same  pin  labels  (0  j)  to  realize  the 
permutation  tt  in  A! .  Conditions  1,  4,  and  5  of  Definition  3  are  immediately  satisfied.  To 
verify  Conditions  2  and  3  we  use  the  following  observation.  In  architecture  A!'  chip  c  is 
connected  via  its  pin  labeled  h  to  bus  0h(c),  while  in  architecture  A!  it  is  connected  to 
bus  /r(0h(c)),  where  xph  G  ^r-  Condition  2  now  holds  since  ir  =  0J  Vi  =  (/r0j)'M/rV'i)- 
Condition  3  holds  since  /r0,  is  a  permutation  on  [nj.  We  therefore  conclude  that  A"  is  a 
uniform  architecture  for  11  with  at  most  k  pins  per  chip.  B 

The  next  theorem  provides  a  lower  bound  on  the  number  of  pins  per  chip  in  any  uniform 
architecture  for  a  permutation  set  IT.  (A  related  theorem  due  to  C.  Fiduccia  appears  in 
[20,  p.  308].) 

Theorem  3  Let  A  =  (C,  5, /*,  CHIP,  BUS,  LABEL)  be  a  uniform  permutation  architecture 
for  a  permutation  set  0.  Then  the  number  of  pins  per  chip  in  A  is  at  least  ^|n|  . 

Proof.  Because  architecture  A  realizes  IT  uniformly,  we  can  associate  each  tt  G  11  with  a 
pair  (t,  j)  of  pin  numbers  such  that  tt  is  realized  by  each  chip  writing  to  its  pin  labeled 
i  and  reading  from  its  pin  labeled  j.  Since  A  is  uniform,  each  chip  has  exactly  IPj  /  |C| 
pins,  and  the  number  of  such  pairs  is  (|P|  /  jCj)^.  No  two  permutations  can  be  associated 
with  the  same  pair,  and  thus,  we  have  (|P|/  |C|)^  >  |n|  or  |P|/  |C|  >  y[nj.  B 

A  permutation  architecture  can  often  nonuniformly  realize  many  more  permutations 
than  the  square  of  the  number  of  pins  per  chip.  As  an  example,  consider  a  “crossbar” 
architecture  of  n  chips  and  n  busses  where  each  chip  is  connected  to  each  bus.  Thi-' 
architecture  can  nonuniformly  realize  all  n!  permutations,  which  is  much  greater  than  n0 
the  square  of  the  number  of  pins  per  chip.  In  Section  7  we  discuss  some  of  the  capabilities 
of  nonuniform  permutation  architectures. 


8 


3  DiflFerence  covers 


In  this  section,  we  present  our  main  theorems  which  establish  the  relationship  between 
difference  covers  for  permutation  se*s  and  uniform  permutation  architectures.  We  also 
prove  some  lemmas  concerning  difference  covers  for  Cartesian  products  of  permutation 
sets.  Finally,  we  present  an  alternative  representation  for  difference  covers  called  substring 
covers  based  on  similar  notions  in  the  literature  of  difference  sets. 

We  first  provide  a  generalization  of  Definition  1  to  arbitrary  sets  of  permutations. 

Definition  5  A  difference  cover  for  a  permutation  set  11  is  a  set  $  =  {<t>o,(p\ . Ofc-i} 

of  permutations  such  that  for  each  ;r  6  11  there  exist  <;/>,,  4>j  £  ^  such  that  x  =  0jVi  - 

Equivalently,  we  can  use  our  product-of-sets  notation  to  say  that  $  is  a  difference  cover 

for  n  if  2  n. 

The  following  two  theorems  show  how  difference  covers  and  uniform  architectures  are 
related.  Theorem  4  describes  how  to  design  a  uniform  architecture  for  a  permutation  set 
n  when  a  difference  cover  for  11  is  given.  Theorem  5  presents  a  construction  of  a  difference 
cover  for  a  permutation  set  11  from  a  uniform  architecture  for  11. 

Theorem  4  Let  11  6e  a  permutation  set,  and  let  ^  be  a  difference  cover  for  11  such  that 
|$|  =  k.  Then  there  exists  a  uniform  architecture  for  IT  with  k  pins  per  chip. 

Proof  Let  $  =  </>i)  •  •  •  i  and  assume  that  IT  is  a  set  of  permutations  on  n 

objects.  We  construct  a  permutation  architecture  for  IT  with  n  busses  and  k  pins  per 
chip.  We  name  the  chips  and  busses  of  the  architecture  by  natural  numbers,  and  the  pins 
by  pairs  of  natural  numbers.  The  architecture  A  =  (C,  B,  P,  CHIP,  BUS,  LABEL)  is  defined 
as  C  =  [n],  B  =  [n],  P  =  [n]  x  [k],  CHlP(c,i)  =  c,  LABEL(c,z)  =  i,  and  BUS(c,  z)  = 
</’LABEL(c,t)(CHIP(c, i))  =  </>i(c).  That  is,  chip  c  is  connected  via  its  pin  number  i  to  bus 

0.(c). 

To  see  formally  that  this  architecture  uniformly  realizes  TI,  let  tt  £  IT  be  a  permutation, 
and  let  <f>j  €  ^  be  elements  of  the  difference  cover  for  IT  such  that  tt  =  Define  the 

write  function  for  ir  as  WRlTE,r(c)  =  (c,  i)  and  define  the  read  function  for  tt  as  READ,t(c)  = 
(c,  j).  (Note  that  i  and  j  are  always  in  the  range  0  through  k  —  1.)  We  now  verify  that 
the  five  Conditions  of  Definition  3  are  satisfied.  Condition  1  holds  since  for  any  chip 
c  e  C  we  have  CHIP(WRITE,(c))  =  CHIP(c,i)  =  c,  and  CHIP(READ,r(c))  =  CHrP(c,j)  =  c. 
Condition  2  is  satisfied  since  for  any  chip  c  G  C  we  have 

BUS(WRITE„(c))  =  BUS(c,z) 

= 

= 

=  BUS(;r(c),;) 

=  BUS(READ,r(7r(c))). 
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Condition  3  holds  because  if  BUS(WRITE,(ci))  =  BUS(WRITE,(c2))  for  any  two  chips 
C\,C2  €  C,  then  we  have  =  <^,(02),  which  implies  that  Ci  =  C2,  since  is  invertible. 

Conditions  4  and  5  both  hold  since  LABEL(write,(c))  =  i  and  LABEL(READ,(c))  =  j  for 
all  chips  c  £  C.  We  therefore  conclude  that  the  architecture  A  uniformly  realizes  IT.  The 
architecture  is  uniform,  but  Theorem  2  obviates  the  need  to  show  this  fact.  I 

Given  a  difference  cover  of  small  cardinality,  Theorem  4  says  we  can  construct  a  uniform 
architecture  with  few  pins  per  chip.  In  fact,  the  reverse  is  true  as  well,  as  the  following 
theorem  shows. 

Theorem  5  Let  11  6e  a  permutation  set,  and  let  A  be  a  uniform  architecture  for  11  with 
k  pins  per  chip.  Then  11  has  a  difference  covert  such  that  |$|  <  k. 

Proof.  Given  a  uniform  architecture  A  =  (C,  B,  P,  CHIP,  BUS,  LABEL)  for  the  permutation 
set  n,  where  k  is  the  number  of  pins  on  each  chip,  we  construct  a  difference  cover  $  for  11 
ais  follows.  Assume  without  loss  of  generality  that  C  =  B  =  [n]  and  range(LABEL)  =  [k]. 
For  each  pin  number  i,  where  i  =  0, . . .  ,k  —  I,  we  define  <j>i  by  <f>i{c)  =  6  if  and  only  if 
chip  c  is  connected  via  its  pin  number  i  to  bus  b.  We  now  define  the  difference  cover  $  to 
be  the  set  $  =  {<^oi  <f>ii  ■  ■  ■  i  (The  set  $  may  have  less  than  k  elements,  since  some 

permutations  may  be  repeated  among  the  <^j’s.) 

To  see  that  $  is  a  difference  cover  for  11,  consider  any  permutation  6  fl.  Since  A 
uniformly  realizes  tt,  there  exists  a  pair  of  pin  labels  (i,  j)  such  that  ir  is  realized  by  each 
chip  writing  to  its  pin  numbered  i  and  reading  from  its  pin  numbered  j.  The  labels  i  and  j 
satisfy  i  =  LABEL(WRiTE,(c))  and  j  =  LaBEL(read*^(c))  for  an  all  chips  c  €  C,  as  follows 
from  Conditions  4  and  5  of  Definition  3.  Conditions  1  and  3  of  Definition  3  imply  that 
<t>i  and  <f>j  are  both  permutations,  and  therefore  there  are  (l>h,<f>i  €  $  such  that  <f>h  =  0, 
and  4)1  =  4>j-  Finally,  Condition  2  of  Definition  3  implies  that  tt  =  4>J^<l>i  =  <i>T^4>hi  which 
proves  that  $  is  indeed  a  difference  cover  for  IT.  I 

Theorems  4  and  5  show  that  uniform  architectures  and  difference  covers  are  very  closely 
related.  Thus,  when  designing  a  uniform  permutation  architecture  for  a  set  of  permuta¬ 
tions,  it  suffices  to  focus  on  the  problem  of  constructing  a  good  difference  cover  for  that 
set. 

The  structure  of  a  permutation  set  can  be  helpful  in  obtaining  a  difference  cover  for  it. 
In  Sections  4  and  5,  we  investigate  the  construction  of  difference  covers  for  cyclic  groups 
of  permutations  and  for  groups  in  general.  Here,  we  examine  permutation  sets  formed  by- 
Cartesian  products. 

Definition  6  Let  Hi  be  a  set  of  permutations  from  .Yi  to  Xi,  and  let  112  be  a  set  of 
permutations  from  X2  to  X2.  Th^  Cartesian  product  H  =  Hi  x  n2  is  the  set  of  permutations 
from  Xi  X  X2  to  Xi  x  X2  define.  <^s  H  =  {(tti,  7r2)  :  rri  €  Hi,  7r2  €  112}.  Operations  on  the 
elements  of  fl  are  performed  c  -/nentwise. 
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The  Cartesian  product  ITi  x  112  is  isomorphic  to  the  Cartesian  product  112  x  lIi.  The 
Cartesian  product  11  =  IIi  x  112  is  an  abelian  permutation  set  if  and  only  if  both  lIi  and 
112  are  abelian  permutation  sets. 

The  next  two  lemmas  provide  bounds  on  the  size  of  difference  covers  for  Cartesian  prod¬ 
ucts  of  permutation  sets.  (Similar  lemmas  hold  for  composition  products  of  permutation 
sets.) 

Lemma  6  Let  Hi  be  a  permutation  set  on  objects,  and  let  fl2  be  a  permutation  set  on 
n2  objects.  Then  the  Cartesian  product  IT  =  IIi  x  112,  which  is  a  permutation  set  on  ni  ■  n2 
objects,  has  a  difference  cover  of  size  [llil  +  in2|. 

Proof.  Let  $  be  the  union  of  ;  tti  6  lli|  and  {(/„,,  ;r2)  :  K2  €  112 }■  Each  per¬ 

mutation  X  =  {xi,X2)  €  n,  can  be  represented  as  (5ri,5r2)  =  (7rf\/„j)"'  •  (/„, ,^-2),  where 
both  /„j)  and  (/„, ,  7r2)  are  in  Thus  $  is  a  difference  cover  for  fl,  and  the  size  of  <I> 
is  exactly  IIIil  4-  |n2(.  I 

Lemma  7  Let  Hi  be  a  permutation  set  on  nj  objects  with  a  difference  cover  and  let  U2 
be  a  permutation  set  on  n2  objects  with  a  difference  cover  $2-  Then  the  Cartesian  product 
$  =  X  $2  Is  a  difference  cover  for  11  =  Hi  x  112. 

Proof.  For  each  x  =  (7ri,7r2)  €  11,  there  exist  such  that  tti  =  and 

there  exist  <^,-j ,  €  ^2  such  that  X2  =  •  We  then  have  (tti  ,  ^2)  =  {d>~^  d>i^ ,  )  = 

where  both  and  {d>ji,d>j2)  are  in  $  =  x  $2-  and  hence  $ 

is  a  difference  cover  for  11.  I 

To  demonstrate  both  the  use  of  difference  covers  and  Lemma  7,  we  present  in  Figure  2  a 
uniform  permutation  architecture  due  to  C.  Fiduccia  [10]  for  realizing  shifts  in 
a  two-dimensional  array.  The  architecture  uniformly  realizes  the  permutation  set 
n  =  {1,  N,  E,S,  W,  NE,  SE,  NW,  SW}  of  eight  compass  directions  plus  the  identity  I.  We 
introduce  two  permutation  sets  fli  =  {I,N,S},  112  =  {I^E,  W),  and  corresponding  differ¬ 
ence  covers  $1  =  {I,  S}  and  $2  =  The  Cartesian  product  lli  x  112  is  11,  and  the  set 

of  permutations  $  =  $1  x  $2  =  {S,SE,  E,  1}  is  a  difference  cover  for  11. 

We  conclude  this  section  by  defining  the  notion  of  a  substring  cover  for  a  permutation 
set  n,  which  is  equivalent  to  the  notion  of  a  difference  cover.  (A  similar  notion  for  difference 
sets  is  well  known  in  the  literature  [6,  26].) 

Definition  7  An  ordered  list  E  =  (<7o,  <ti,  . . .  ,crfc_i)  of  permutations  is  a  substring  cover 
for  a  permutation  set  11  if 

1.  (Toai  •  •  ■  ak-i  =  /,  and 

2.  for  all  7-  £  11,  there  exist  0  <  i,j  <  k  —  I  such  that  tt  =  <T,cr,+i  •  •cTj,  where  the 
arithmetic  in  the  indices  is  performed  modulo  k. 
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Figure  2:  A  uniform  architecture  due  to  C.  Fiduccia  [10]  based  on  the  difference  cover  {S,  SE,  E,  1} 
for  the  permutation  set  II  =  {I,  N,  E,S,  W,NE,SE,  NW,SW}. 

The  substring  cover  S  is  a  list  of  permutations  such  that  all  the  permutations  in  H  can 
be  represented  as  a  composition  of  a  substring  of  permutations  of  S.  The  following  two 
theorems  show  that  the  notions  of  a  substring  cover  and  difference  cover  are  equivalent. 

Theorem  8  Let  U  be  a  permutation  set  on  n  elements,  and  let  Ti  be  a  k-element  substring 
cover  for  0.  Then  11  has  a  difference  covert  with  at  most  k  elements. 

Proof.  Given  a  A:-element  substring  cover  E  =  {cxq,  ctj  , . . . ,  cr^.j)  for  IT,  a  difference  cover 
<&  with  at  most  k  elements  can  be  constructed.  For  each  0  <  i  <  A:  —  1  we  define  (f>,  = 
CTofTi  ■  ■  •  a,.  If  a  permutation  tt  can  be  represented  as  tt  =  tT.tr.+i  •  •  •  (Tj,  then  tt  = 

By  construction,  the  difference  cover  $  has  at  most  k  elements.  I 

Theorem  9  Let  U  be  a  permutation  set  on  n  elements,  and  let  ^  be  a  k-element  difference 
cover  for  fl.  Then  IT  has  a  substring  cover  E  with  k  elements. 

Proof.  Given  a  ^-element  difference  cover  $  =  {(f>o,  <f>\, . . . ,  0fc_i}  for  11,  we  build  a  sub¬ 
string  cover  E  for  0  by  defining  cr,  =  (bffid),  for  all  0  <  i  <  —  1.  The  product  (Tq  ■  cr^-i 

yields  the  identity  permutation.  For  each  tt  €  IT,  if  tt  =  then  x  = 

Therefore  E  is  a  substring  cover  for  11  with  k  elements.  I 

Referring  back  to  the  example  of  the  eight  compass  directions,  we  present  a  substring 
cover  for  the  permutation  set  11  =  [I,  N,  E,  S,  W,  NE,  SE,  NW,  SW).  Tlie  substring  cover 
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S  =  (S,  E,N,  W)  is  constructed  from  the  difference  cover  $  =  {S,SE,  E,  1}  that  was  used 
in  the  architecture  of  Figure  2.  Each  of  the  eight  compass  directions  can  be  realized  as  a 
substring  of  the  list  S  =  (S,  E,  N,  W). 


Figure  3:  A  uniform  architecture  due  to  C.  Feynman  [15]  based  on  the  difference  cover  {N,  E,  1} 
for  the  permutations  set  II  =  {I,  N,E,S,W}. 

As  another  example,  consider  the  permutation  set  11  =  {I,N,E,S,W}  of  the  shifts  in 
a  2-dimensional  array  corresponding  to  the  four  compass  directions.  This  permutation  set 
has  a  difference  cover  $  =  {N,  E,  1}  and  a  corresponding  substring  cover  E  =  (N,SE,W). 
Consequently,  there  is  a  uniform  architecture  for  realizing  the  four  compass  directions 
with  three  pins  per  chip,  as  haw  been  observed  by  C.  Feynman  [15,  pp.  437-438],  Fig¬ 
ure  3  presents  a  uniform  architecture  based  on  the  difference  cover  $  =  {N,E,I}  for  the 
permutation  set  11  =  {I,N,E,S,W}. 

4  Cyclic  shifters 

This  section  describes  uniform  architectures  for  realizing  cyclic  shifts  among  n  chips  in 
one  clock  tick.  We  first  present  a  difference  cover  of  size  0{\/n)  for  the  set  of  all  n 
cyclic  shifts  on  n  elements,  and  we  give  an  area-efficient  layout  for  the  corresponding 
permutation  architecture  suitable  for  implementation  as  a  printed-circuit  board.  When  n 
can  be  expressed  as  n  =  -|-  9  -f- 1,  where  9  is  a  power  of  a  prime,  we  improve  the  bound 

on  the  size  of  a  difference  cover  for  all  cyclic  shifts  on  n  elements  to  the  optimal  value  of 


[v^l-  Finally,  we  prove  that  for  any  cyclic  shifter  that  operates  in  one  clock  tick  (even  a 
nonuniform  one),  the  average  number  of  pins  per  chip  is  at  least 

The  first  permutation  architecture  for  cyclic  shifters  that  we  present  is  based  on  the 
construction  in  the  following  simple  theorem. 

Theorem  10  The  set  of  n  cyclic  shifts  on  n  elements  has  u  difference  cover  of  size  at 
most  2  f \/n]  —  1. 

Proof  Since  the  set  of  n  cychc  shifts  on  n  elements  forms  a  group,  and  since  this  group  is 
isomorphic  to  the  group  Zn,  we  shall  construct  a  difference  cover  D  for  Z„.  For  convenience, 
let  m  =  fv/n].  Define  two  sets  ^4  =  {0, 1, . . . ,  m  —  1}  and  B  =  {0,m,2m, ...  ,{m  —  l)m}. 
and  let  the  difference  cover  D  be  defined  hy  D  =  AU  B.  Each  element  s  G  Z„  can  be 
realized  as  s  =  b  —  a  (mod  n),  where  a  €  A  and  b  £  B  hy  taking  a  =  m  —  (s  mod  m)  and 
6  =  (s /m]  •  m,  <is  can  be  verified.  The  size  of  the  difference  cover  D  is  2m  —  1  =  2  [ \/n  ]  —  1 , 
since  the  element  0  occurs  in  both  A  and  B.  I 

The  difference  cover  constructed  in  the  proof  of  Theorem  10  corresponds  to  an  archi¬ 
tecture  with  a  regular,  area-efficient  layout,  as  shown  in  Figure  4.  The  n  chips  of  the 
architecture  are  laid  out  in  an  array  consisting  of  m  =  >/n  rows,  each  containing  -ffn 
chips.  (For  simplicity,  we  assume  that  n  is  a  square.)  Each  chip  has  pins  0, 1, ...  ,m  —  1 
on  the  top  side,  and  pins  m,m  +  l,...,2m  —  1  on  the  left  side.  Each  bus  consists  of 
one  vertical  segment  and  one  or  two  horizontal  segments.  Each  wiring  channel  consists  of 
m  =  iffn  tracks,  where  each  track  is  used  to  lay  out  segments  of  busses.  When  n  is  not 
a  square,  a  cyclic  shifter  on  n  chips  can  be  laid  out  in  a  similar  fashion,  with  each  wiring 
channel  having  at  most  2  [\/nl  tracks.  The  side  of  the  layout  is  therefore  0(n),  since 
there  are  (x/nl  chips  and  fv/nl  wiring  channels  along  the  side.  The  area  of  the  layout  is 
0(n*),  which  is  asymptotically  optimal  since  any  architecture  that  can  realize  any  of  the 
cyclic-shift  permutations  in  one  clock  tick  requires  area  fl(n*)  [30,  p.  56]. 

Remark.  The  bound  of  2  [\/nl  ~  1  per  chip  can  be  improved  to  (v^ -|-  o(l))>/n. 
See  Section  8. 

Occasionally,  it  is  desirable  to  implement  a  subset  of  the  cyclic  shifts  on  n  elements.  The 
following  corollary  to  Theorem  10  shows  that  when  the  shift  amounts  form  an  arithmetic 
sequence,  a  small  difference  cover  exists. 

Corollary  11  Let  a,  b,  and  p  be  integers  modulo  n.  For  each  r  G  [p],  define  to  be  the 
permutation  on  [n]  that  maps  each  c  G  [n]  to  c  +  a +  rb  (modn).  Then  the  permutation  set 
{tt,.  :  r  G  [p]}  has  a  difference  cover  of  size  2  [v/p]  • 

Proof.  As  in  the  proof  of  Theorem  10,  we  construct  two  sets  A  and  B  whose  union  is 
the  desired  difference  cover.  The  sets  are  A  =  {0, 6, 26, . . . ,  (m  —  1)6}  and  B  =  {a,  a  -f  mb, 
a  -H  2m6, . . .  ,a  4-  (m  —  l)m6},  where  m  =  [>/?]•  B 

Returning  to  the  problem  of  implementing  all  n  cyclic  shifts  on  n  elements,  the  follow¬ 
ing  theorem  demonstrates  that  for  certain  values  of  n,  the  optimal  f y/n]  bound  can  be 
obtained. 
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Figure  4:  A  layout  for  a  cyclic  shifter  with  n  =:  16  chips.  Each  chip  and  each  bus  has  7  pins. 
Each  bus  is  constructed  of  one  vertical  segment  and  either  one  or  two  horizontal  segments. 

Theorem  12  The  set  ofn  cyclic  shifts  on  n  elements  has  a  difference  cover  of  size  \y/n^ 
ifn  —  where  q  is  a  power  of  a  prime. 

Proof.  As  in  the  proof  of  Theorem  10,  the  problem  is  equivalent  to  that  of  constructing 
a  difference  cover  D  for  Z„.  When  n  is  the  size  of  a  projective  plane  (n  =  +  g  +  1, 

where  9  is  a  power  of  a  prime),  this  problem  is  equivalent  to  the  problem  of  constructing  a 
difference  set.  The  difference  set  we  give  is  due  to  Singer;  a  proof  of  its  correctness  is  given 
in  Hall  [18,  p.  129].  Let  x  be  a  primitive  root  of  the  Galois  field  GF(9^),  and  let  F{y)  be 
any  irreducible  cubic  polynomial  over  the  Galois  field  GF{q).  We  construct  a  difference 
cover  D  for  Z„  from  the  set  [n]  by  choosing  those  i  €  [n]  such  that  the  power  x'  can  be 
written  in  the  form  x'  =  ax  +  6  (mod  F(x))  for  some  a,  6  €  GF(9).  I 

The  construction  of  a  uniform  architecture  based  on  a  projective  plane  can  be  inter¬ 
preted  as  follows.  The  n  points  of  the  projective  plane  correspond  to  the  n  chips  and  the 
n  lines  of  the  projective  plane  correspond  to  the  n  busses.  Each  line  contains  ?  +  1  points, 
which  means  that  each  bus  is  connected  to  ^  +  1  chips.  Each  point  is  incident  on  9  -I-  1 
lines,  which  means  that  each  chip  is  connected  to  q-  -I-  1  different  busses  through  its  7  -f  1 
pins.  For  example.  Figure  1  demonstrates  a  uniform  architecture  based  on  the  projective 
plane  of  size  13. 
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Theorems  similar  to  Theorem  10  (but  without  application  to  architecture)  appear  in 
the  combinatorics  literature:  see,  for  example,  [22].  Bus  connection  networks  based  on 
projective  planes  have  also  been  studied  by  Bermond,  Bond,  and  Scale  [4]  and  by  Mick- 
unas  [25],  who  observed  that  projective  planes  can  be  used  to  construct  hypergraphs  of 
diameter  one. 

Uniform  architectures  for  cyclic  shifters  based  on  projective  planes  achieve  the  minimal 
number  of  pins  per  chip  among  all  uniform  cyclic  shifters.  We  now  prove  a  lower  bound  of 
[v^l  on  the  average  number  of  pins  per  chip  for  any  permutation  architecture  that  realizes 
all  the  cyclic  shifts.  This  lower  bound  applies  to  all  permutation  architectures,  including 
nonuniform  ones,  and  shows  that  uniform  cyclic  shifters  based  on  projective  planes  are 
optim«il  among  all  cyclic  shifters  that  operate  in  a  single  clock  tick. 

Theorem  13  Let  A  =  (C',B,P,  CHIP,  BUS,  LABEL)  be  a  permutation  architecture  for  the 
n  cyclic  shifts  on  n  chips.  Then  the  average  number  of  pins  per  chip  is  at  at  least  f\/nl  • 

Proof.  The  average  number  of  pins  per  chip  is  jP|  /n.  We  shall  prove  that  \P\  >  n  \y/n\ 
which  implies  the  theorem.  We  adopt  the  following  conventions  for  notational  convenience: 

1.  The  set  of  busses  is  B  =  {6oi  i  We  denote  by  k,  the  number  of  pins 

connected  to  bus  6,,  that  is,  A:,  =  |{p  €  P  :  BUS(p)  =  5,}|. 

2.  The  busses  that  have  at  least  \y/n]  pins  each  are  indexed  first,  that  is,  if  there  are 
r  busses  with  at  least  fv/n]  pins  each,  then  Jk,-  >  \\/n]  for  i  =  0,...,r  -  1  and 
ki  <  f'v/nl  for  i  =  r, . . .  ,m  —  1. 

The  thrust  of  the  proof  is  to  count  the  number  of  distinct  data  transfers  when  the 
architecture  realizes  each  of  the  n  —  1  nontrivial  shifts  in  turn.  (The  identity  permutation 
is  a  trivial  shift.)  Each  chip  can  be  mapped  to  each  other  chip  by  one  of  the  cyclic  shifts, 
i.e.,  the  cyclic  shifts  form  a  transitive  group  of  permutations.  Considering  only  the  n  —  1 
nontrivial  shifts,  there  are  exactly  n(n— 1)  distinct  data  transfers  that  must  be  implemented 
through  interconnections  in  the  architecture. 

We  compute  an  upper  bound  on  the  number  of  distinct  data  transfers  that  the  busses 
can  implement.  Each  of  the  first  r  busses  bo,. . 6^-1  can  be  employed  to  realize  at  most 
one  distinct  data  transfer  in  each  of  the  n  —  1  nontrivial  shifts.  Thus,  at  most  r(n  —  1) 
distinct  data  transfers  can  be  carried  out  by  the  first  r  busses.  Any  other  bus  6,,  where 
r  <  i  <  m  —  1,  can  realize  at  most  fc,(A:,  —  1)  distinct  nontrivial  data  transfers,  since  it  has 
only  ki  pins  connected  to  it.  Thus,  the  total  number  of  distinct  data  transfers  that  the 
busses  can  realize  is 

-  1)  +  ^  ki{ki  -  1)  , 

t=r 

which  must  be  larger  than  n(n  -  1)  if  all  nontrivial  shifts  are  to  be  realized.  Hence,  we 
have 

m-1 

^  ki{ki  -  1)  >  (n  -  r){n  -  1)  . 

i=r 
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We  can  use  this  inequality  to  bound  the  number  of  pins  on  all  busses  with  fewer  than 
f>/nl  pins.  We  have  A:,  —  1  <  ly/n]  —  2  for  i  =  r, . . . ,  m  —  1,  and  thus 

P'  ^ 

>  (n  -  r)(n  -  1) 

[v^l-2 

>  (n  -  r)  [v/n]  . 

We  now  bound  the  total  number  of  pins  in  the  architecture  from  below.  We  have 

m-1 

\p\  =  E*. 

«=o 

r—l  til— 1 

= 

t=0  t=r 

>  r  [\/n]  +  (n  —  r)  [\/n] 


which  proves  the  theorem.  I 


5  Difference  covers  for  groups 

In  this  section  we  show  that  small  difference  covers  for  abelian  and  nonabelian  permutation 
groups  exist.  Specifically,  for  any  permutation  group  11  with  p  elements,  we  show  how  to 
construct  a  difference  cover  with  0(>/plgp)  elements.  In  the  case  where  11  is  abelian,  we 
apply  the  decomposition  theorem  for  finite  abelian  groups  and  the  results  for  cyclic  shifters 
in  Section  4  to  sharpen  this  bound  to  0{y/p),  which  is  optimal  to  within  a  constant  factor. 

As  the  first  result  of  this  section,  we  give  a  method  for  constructing  a  small  difference 
cover  for  an  arbitrary  permutation  group. 

Theorem  14  LetH  be  an  arbitrary  group  with  p  elements.  Then  11  has  a  difference  cover 
$  of  size  at  most  v^Splnp  +  1. 

Proof.  We  construct  a  difference  cover  incrementally  starting  with  a  partial  difference 
cover  $]  =  {/}.  At  each  step  of  the  construction,  we  select  an  element  (^,+i  €  11  such 
that  maximizes  U  {7r})j  over  all  tt  G  11.  We  then  define  the  new 

partial  difference  cover  as  U 

The  analysis  of  this  construction  is  in  three  parts.  We  first  determine  a  lower  bound  on 
the  number  of  elements  of  11  that  are  not  covered  by  the  partial  difference  cover  but  are 
covered  by  $,+i.  We  then  develop  a  recurrence  to  upper  bound  the  number  of  elements 
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of  the  group  11  that  are  not  covered  at  the  ith  step.  Finally,  we  solve  the  recurrence  to 
determine  that  the  number  k  of  iterations  needed  to  cover  all  elements  in  11  is  at  most 
v/2p  In  p  +  1. 

We  first  determine  how  many  new  elements  of  IT  are  covered  when  is  augmented  with 
to  produce  $i+i,  for  i  >  1.  Let  the  set  A,  be  the  set  of  elements  that  are  not  covered 
by  the  partial  difference  cover  which  can  be  defined  as  A,  =  11  —  Consider 

triples  of  the  form  {<f>,  6,  tt)  such  that  4>  €  S  6  A,,  tt  €  H,  and  <f>S  =  tt.  Observe  that  for 
any  fixed  tt  6  H  and  S  €  A,,  there  is  at  most  one  triple  of  the  form  (<^,  <5,  tt)  in  the  set  of 
triples,  namely  (7rS~\S,}r}  when  For  a  fixed  t,  the  number  of  triples  {(f),  8,  it) 

in  the  set  of  triples  is  a  lower  bound  on  the  number  of  elements  covered  by  U  {tt}  but 
not  by  since  we  have  8  =  <f)~^TT  and  ^  €  Aj  =  11  —  For  each  (f)  £  c.nd  8  E  A,, 

there  is  exactly  one  triple  in  the  set  of  triples,  and  thus  there  are  exactly  |$,|  •  |Ai|  triples. 
Since  there  are  at  most  |n|  distinct  permutations  appearing  as  the  third  coordinate  of  a 
triple,  the  permutation  (f>i+i  that  appears  most  often  must  appear  at  least  ($,|  •  |A,|  /  |n| 
times,  and  hence  at  least  this  many  elements  are  covered  by  that  are  not  covered  by 

We  can  now  bound  the  number  of  elements  not  covered  by  in  terms  of  the  number 
of  elements  not  covered  by  by 


I  A, +11  <  lAil - 


m  •  lAil 


When  we  obtain  |A/t|  <  1  for  some  k,  the  partial  difference  cover  is  a  difference  cover 
for  n  because  A,t  is  empty.  Thus,  is  a  difference  cover  when 


or  equivalently,  when 


InP+Ein  (^~p) 


Using  the  inequality  ln(l  +  a:)  <  x,  we  have 

lnp  +  |:in(l-^)  <  Inp-'fl 
\  PJ  j=i  P 

1 

=  In  p  -  -  E 
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<  Inp  — 

<  0. 


2p 


Thus,  $jt  is  a  difference  cover  when  k  >  v/2p  In  p  +  1.  I 

This  proof  of  Theorem  14  provides  a  construction  which  can  be  implemented  as  an 
deterministic,  polynomial-time  algorithm  with  O(p^lgp)  algebraic  steps.  VVe  could  also 
have  proved  the  theorem  by  relying  on  the  result  of  Babai  and  Erdos  [2]  that  any  group 
has  a  small  set  of  generators,  but  this  method  would  have  produced  only  an  existential 
(nonconstructive)  result. 

We  have  shown  that  there  are  difference  covers  of  size  0(\/plgp)  for  general  permuta¬ 
tion  groups  with  p  elements.  We  now  show  that  if  the  group  is  abelian,  difference  covers 
of  size  0(yp)  exist. 


Theorem  15  For  any  abelian  group  11  with  p  elements,  there  exists  a  difference  cover  $ 
of  size  at  most  3y/p  . 


Proof.  Assume  without  loss  of  generality  that  p  >  1.  By  the  decomposition  theorem  for 
finite  abelian  groups  [23,  p.  133],  any  abelian  group  11  is  isomorphic  to  a  cross  product  of 
cyclic  groups 

n  «  Zpi  X  Zp,  X  •••  X  Zp^, 

where  pip2  •  •  •  Pfe  =  p,  and  each  pj  >  2.  Let  i  be  the  unique  index  such  that  pipj  •  •  •  p,_i  < 
y/p  and  p,+ip,+2  •  •  •  Pfe  <  y/p,  and  let  m  =  \y/p  fpipz  ■  •  -  pi-i\-  Using  the  argument  of 
Theorem  10,  we  first  construct  a  difference  cover  for  Zp^  from  the  union  of  two  sets  A,  and 
Bi,  where  |Aj|  <  m  and  [Bj]  <  [pi/mj,  such  that  each  element  of  Zp,  can  be  expressed  in 
the  form  b  —  a  (modp^)  or  a  —  6  (modp,),  where  a  £  A,  and  b  €  S,. 

We  now  construct  a  difference  cover  for  11  w  Zp,  x  Zpj  x  •  •  •  x  Zp^  from  the  union  of 
two  sets  A  and  B,  where 

A  «  Zp,  X  Zp,  X  . .  -  X  Zp,_,  X  A„ 

and 

B  ^  Bi  X  Zp,^,  X  Zp,^,  X  •  •  •  X  Zp^. 

That  A  U  B  is  a  difference  cover  for  11  follows  from  essentially  the  same  argument  as  is 
used  in  Lenuna  7. 

The  size  of  the  difference  cover  AUB  is  |A|-}-lBl.  The  size  of  A  is 
|A|  =  piP2---Pi-i\Ai\ 

<  piP2---pi-\m 

<  piP2--  p,-i  r>/p/piP2---p,_ii 

<  VP  +  P\P2  ■  ■  •  Pi-l 

<  2y/p. 
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Similarly,  the  size  of  B  is 

|5|  =  \Bi\pi+ipi+2---Pk 

<  [pi/m\pi+iPi+2---Pk 

<  {Pi/  \\/P  /PlP2  •  ■  •  P.-1I  )  P.+1P.+2  ■■■Pk 

<  (P1P2  •  •  •  P./ Vp  )  P.+1P.+2  •■•Pk 
=  s/p- 

Consequently,  the  size  of  the  difference  cover  for  II  is  at  most  I 


6  Multiple  clock  ticks 

In  this  section  we  discuss  uniform  permutation  architectures  that  realize  permutations  in 
several  clock  ticks.  By  using  more  than  one  clock  tick,  further  savings  in  the  number  of 
pins  per  chip  can  be  obtained.  We  generalize  the  notion  of  a  difference  cover  to  handle 
multiple  clock  ticks,  and  describe  a  cyclic  shifter  on  n  chips  with  only  pins  per 

chip  that  operates  in  t  ticks. 

We  first  generalize  the  notion  of  a  difference  cover  to  handle  realization  of  permutations 
in  f  >  1  clock  ticks. 

Definition  8  A  t-difference  cover  (ov  a  permutation  set  11  is  a  set  $  of  permutations  such 

that  D  n. 

Using  a  f-difference  cover  $  for  the  permutation  set  11,  any  permutation  r  €  H  can  be 
expressed  as  the  composition  of  t  differences  of  permutations  from  $.  The  next  lemma 
relates  f-difference  covers  to  permutation  architectures  that  realize  permutations  in  f  clock 
ticks. 

Lemma  16  Let  ^  be  a  t-difference  cover  with  k  elements  for  a  permutation  set  11.  Then 
there  is  a  permutation  architecture  with  k  pins  per  chip  that  uniformly  realizes  11  in  f  clock 
ticks. 

Proof.  We  define  the  permutation  set  E  =  Let  A  =  (C,  B,  F,  CHIP,  BUS,  LABEL) 

be  the  permutation  architecture,  based  on  the  difference  cover  $,  that  uniformly  realizes 
S.  Hence,  the  permutation  architecture  A  can  uniformly  reedize  any  o-  €  E  in  one  clock 
tick.  Each  permutation  ir  €  11  can  be  expressed  as  tt  =  at-iCft-2  •  •  •  o’o,  where  (Ti  €  E  for 
0  <  i  <  f  —  1,  since  we  have  E*  =  ($-!$)«  D  n.  In  order  to  realize  tt  in  f  clock  ticks,  the 
permutation  architecture  A  uniformly  realizes  cr^  in  clock  tick  i  for  0  <  i  <  f  —  1.  I 

Lemma  16  claims  that  the  problem  of  uniformly  realizing  a  permutation  set  H  in  t 
clock  ticks  can  be  reduced  to  finding  a  permutation  set  E  such  that  E‘  D  0,  and  then 
finding  a  difference  cover  for  E.  The  great  advantage  of  using  more  than  one  clock  tick  is 
in  the  further  savings  in  the  number  of  pins  per  chip.  The  following  theorem,  for  example. 
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describes  a  construction  of  a  t-difference  cover  of  size  for  the  set  of  cyclic  shifts 

on  n  objects.  This  result  can  be  used  to  build  a  uniform  architecture  on  n  chips  with  only 
pins  per  chip  that  can  realize  any  cyclic  shift  on  the  n  chips  in  t  clock  ticks. 

Theorem  17  For  any  n  >  1  and  t  >1,  the  permutation  set  of  all  the  n  cyclic  shifts  on  n 
objects  has  a  t-difference  cover  of  size 

Proof.  For  the  purpose  of  the  proof,  we  denote  the  permutation  set  of  all  the  n  cyclic 
shifts  on  n  objects  by  n„.  (We  remind  that  !!„  «  Z„.)  We  first  treat  the  case  for  those 
n  such  that  there  exists  an  integer  m  satisfying  ^  rn  <  and  gcd(m,  n)  =  1.  We 
then  use  this  case  to  extend  the  proof  to  all  values  of  n. 

Since  gcd(m,n)  =  1,  there  exists  an  m~^  £  Z„  such  that  m  ■  m~^  =  1  (modn).  For 
each  r  £  [m],  define  the  permutation  :  [n]  —*■  [n]  as  cr^ic)  =  m~'(c  +  r)  (modn),  and 
define  the  permutation  <t(.  :  [n]  — ►  [n]  as  +  r)  (modn).  Next  define  the 

permutation  set  S  =  {(Tr)  U  {<7'}.  The  set  {a,.}  is  an  arithmetic  sequence  of  cyclic  shifts 
on  n  elements  (as  in  Corollary  11)  followed  by  the  fixed  permutation  corresponding  to 
multiplication  by  m~',  and  thus  {<Tr}  has  a  difference  cover  of  size  0(\/m).  Similarly,  the 
set  {<t'}  has  a  difference  cover  of  size  0(\/m).  Combining  the  two  difference  covers  for 
{(7,.}  and  (o-'j,  we  get  a  difference  cover  $  of  size  0(\/m)  =  0(n^/^‘)  for  E. 

We  now  show  the  inclusion  S‘  3  n„.  Let  jt  €  n„  be  a  permutation  of  a  cyclic  shift  by 
s.  We  express  the  shift  amount  s  €  [n]  as  s  =  sq  +  sim  +  •  •  •  +  St_im‘“^  where  s;  €  [m] 
for  0  <  t  <  t  —  1.  The  permutation  ir  can  be  described  as 

7r(c)  =  c  +  s(modn) 

=  c-f-So  +  5im  + - 1- (mod  n) 

=  ^Sf_i  +  m~‘ H - b  (sq  +  c)j)  (modn) 

which  proves  that  tt  €  S‘.  Hence,  we  get  the  inclusion  E*  3  n„,  which  together  with  the 
fact  that  there  is  a  difference  cover  $  of  size  for  E,  proves  the  theorem  for  the 

case  when  there  exists  an  integer  m  satisfying  n'*^*  <rn<  4n*''‘  and  gcd(m,n)  =  1. 

Such  an  m  need  not  exist  for  every  n  and  every  f,  however.  We  can  overcome  this 
difficulty  by  factoring  n  =  ninj  such  that  ni  consists  of  no  even-indexed  primes  (3,  7,  13. 
. . .)  and  n2  consists  of  no  odd-indexed  primes  (2,  5,  11, . . .).  Since  we  have  gcd(ni,  n2)  =  1. 
we  can  use  the  Chinese  remainders  theorem  to  express  Z„  as  a  Cartesian  product  Z„  « 
Zni  X  Z„j.  We  let  nti  be  the  first  even-indexed  prime  at  least  as  large  as  and  let 
mj  be  the  first  odd-indexed  prime  at  least  as  large  as  n^*.  Bertrand’s  postulate  [19. 
p.  343]  guarantees  that  for  every  i,  there  is  a  prime  between  x  and  2i,  which  means 
ruj  £  for  j  =  1,2.  (Tighter  bounds  are  possible.) 

We  can  now  use  the  previous  construction  to  construct  a  f-difference  cover  of  size 
0{n\^^^)  for  Z„,,  which  is  isomorphic  to  n„, ,  and  a  f-difference  cover  $2  of  size 
for  Z„j,  which  is  isomorphic  to  Using  the  same  technique  as  in  the  proof  of  Lemma  7, 
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we  can  construct  a  t-difFerence  cover  of  size  •  0{n2^^)  =  0(n*^^‘)  for  Z„,  x  Z^j  % 

Z„  «  Iln.  ■ 

One  can  rather  straightforwardly  use  Corollary  11  to  obtain  a  <-difference  cover  of  size 
Based  on  the  representation  of  the  shift  amount  3  =  so  +  Sim  +  •  •  •  +  St-irn‘~\ 
one  can  come  with  t  separate  difference  covers,  each  of  size  for  the  t  separate 

sequences  of  arithmetic  shifts  by  {am*  :  s  £  [m]}  for  0  <  i  f  —  1.  Theorem  17  avoids 
the  extra  factor  of  t  by  constructing  only  one  such  difference  cover  and  using  its  elements 
for  each  one  of  the  t  differences. 


7  Extensions 

This  section  contains  some  additional  results  on  permutation  architectures  and  difference 
covers.  We  describe  efficient,  uniform  architectures  that  can  realize  the  permutations 
implemented  by  various  popular  interconnection  networks,  including  multidimensional 
meshes,  hypercubes,  and  shuffle- exchange  networks.  We  examine  nonuniform  permuta¬ 
tion  architectures,  and  adapt  some  combinatorial  results  in  the  literature  to  apply  to 
permutation  architectures.  A  result  of  DeBruijn  leads  to  a  nonuniform  architecture  with 
0(\/nlgn  )  pins  per  chip  that  can  reaUze  all  n!  permutations  on  n  chips. 


7.1  Specific  networks 

By  using  busses,  many  popular  interconnection  networks  can  be  realized  with  fewer  pins 
than  conventionally  proposed.  Here,  we  mention  a  few. 

The  permutation  architectures  for  realizing  compass  shifts  on  two-dimensional  arrays 
can  be  extended  in  a  natural  fashion  to  d-dimensional  arrays.  For  the  d-dimensional 
analogue  of  the  shifts  {I,N,E,S,  W},  there  is  a  uniform  architecture  that  uses  only  d  -I-  1 
pins  per  chip  to  implement  the  2d  -I-  1  permutations.  For  the  d-dimensional  analogue  of 
the  shifts  {I,  N,  E,  S,  W,  NE,  SE,  NW,SW},  there  is  a  uniform  architecture  that  uses  only 
2*^  pins  per  chip  to  implement  the  3**  permutations.  (These  two  results  were  independently 
discovered  by  C.  Fiduccia  [11,  12].) 

A  Boolean  hypercube  of  dimension  d  is  a  degenerate  case  of  a  d-dimensional  array. 
Only  d  +  \  pins  per  chip  are  required  by  a  permutation  architecture  that  uses  busses, 
whereas  2d  pins  per  chip  are  needed  if  point-to-point  wires  are  used.  (To  realize  a  swap 
of  information  across  a  dimension  in  one  clock  tick,  each  chip  requires  two  pins  for  that 
dimension:  one  to  read  and  one  to  write.) 

A  permutation  architecture  that  implements  the  permutations  Shuffle,  Inverse  Shuffle, 
and  Exchange  can  be  constructed  with  three  pins  per  chip  instead  of  the  usual  four,  and 
it  can  implement  the  Shuffle- Exchange  and  Inverse  Shuffle- Exchange  permutations  in  one 
tick  as  well. 
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7.2  Average  number  of  pins  per  chip 

Theorem  13  presents  a  lower  bound  on  the  average  number  of  pins  per  chip  in  any  cyclic 
shifter  that  operates  in  one  clock  tick.  The  following  theorem  is  a  natural  extension  of 
Theorem  13  for  a  general  set  of  permutations. 

Theorem  18  Let  U.  be  a  permutation  set  on  n  objects  with  p  permutations  and  with  total 
ofT  nontrivial  data  transfers,  and  let  A  =  (C,  B,  P,  CHIP,  BUS,  LABEL)  be  any  permutation 
architecture  for  realizing  11.  Then  the  average  number  of  pins  per  chip  is  at  least  T/n^Jp  . 

Proof.  As  in  the  proof  of  Theorem  13,  we  prove  that  |P|  >  T / y/p  which  implies  the 
theorem.  We  make  similar  notational  conventions: 

1.  The  set  of  busses  is  B  =  {6oi  •  •  •  ?  We  denote  by  ki  the  number  of  pins 

connected  to  bus  6,. 


2.  The  r  busses  that  have  at  least  y/p  pins  each  are  indexed  first,  that  is  ki  >  y/p  for 
i  =  0, . . . ,  r  —  1  and  ki  <  y/p  for  i  =  r, . . . ,  m  —  1 . 


We  count  the  number  of  distinct  data  tramsfers  that  can  be  accomplished  by  each  bus. 
Each  of  the  first  r  busses  can  be  employed  to  realize  at  most  p  out  of  the  T  nontrivial  data 
transfers,  since  it  can  be  used  at  most  once  for  each  of  the  p  permutation.  Any  other  bus 
bi,  where  r  <  i  <  m  —  1,  can  realize  at  most  —  1)  out  of  the  T  nontrivial  data  transfers, 
since  it  has  only  k,  pins  connected  to  it.  We  need  to  have  ki{ki  -  1)  >  T-rp,  which 
implies 


^ 

7^r  \/P 

^  r 

The  number  of  pins  in  the  architecture  C2ui  now  be  bounded  as  follows: 

m  — 1 

\p\  =  E*- 

«=o 

r— 1  m-1 

=  E  +  E  ** 

1=0  i=r 

^  ^Vp  + 


Vv/P 


T 

v/P 


Theorem  18  demonstrates  that  uniform  architectures  can  achieve  the  optimal  number 
(to  within  a  constant  factor)  of  pins  per  chip  for  certain  classes  of  permutation  sets. 
When  there  are  relatively  few  permutations  that  are  responsible  for  many  nontrivial  data 
transfers,  the  average  number  of  pins  per  chip  is  high.  The  set  of  cyclic  shifts  is  an  exampK' 
of  this  kind  of  permutation  set. 
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7.3  Nonuniform  architectures 

When  the  uniformity  condition  on  permutation  architectures  is  dropped,  one  can  do  much 
better  in  terms  of  the  number  of  pins  per  chip.  The  complexity  of  control  may  increase 
substantially,  however,  due  to  the  irregular  communication  patterns  and  the  number  of 
possible  permutations  realizable  for  some  of  the  architectures.  Nevertheless,  from  a  math¬ 
ematical  point  of  view,  nonuniform  architectures  are  quite  interesting. 

In  fact,  nonuniform  architectures  have  been  studied  quite  extensively  in  the  mathe¬ 
matics  literature  in  the  guise  of  partitioning  problems.  For  the  problem  of  realizing  all  n! 
permutations  on  n  chips,  a  result  due  to  de  Bruijn,  Erdos,  and  Spencer  [31,  p.  106-108] 
implies  that  0{ \/n  Ig  n  )  pins  per  chip  suffice.  The  nonuniform  architecture  that  achieves 
this  bound  is  constructed  probabilistically,  however.  It  is  an  open  problem  to  obtain  this 
bound  deterministically.  The  best  deterministic  construction  to  date  is  due  to  Feldman, 
Friedman,  and  Pippenger  [9]  and  uses  pins  per  chip. 


8  Further  research 

In  this  section  we  list  a  few  of  the  problems  that  have  been  left  open  by  our  research.  We 
also  describe  briefly  some  further  work  brought  on  by  an  earlier  version  [20]  of  our  work. 

In  Section  4  we  described  a  difference  cover  of  size  2  \\/n'\  —  1  for  the  cyclic  group  Z„, 
and  proved  that  when  n  is  the  order  of  a  projective  plane,  there  is  a  difference  cover  of 
size  [\/n].  It  seems  reasonable  that  any  cyclic  group  Z„  might  actually  have  a  difference 
cover  of  size  i/n  -f  o(-\/n ),  but  we  have  been  unable  to  prove  or  disprove  this  conjecture. 
Mills  and  Wiedemann  [27]  have  computed  a  table  of  minimal  difference  covers  for  all  the 
cyclic  groups  of  cardinality  up  to  110.  For  any  value  of  n  up  to  110,  the  difference  cover 
they  find  has  at  most  [\/nl  +  2  elements.  They  also  provide  [28]  a  “folk  theorem"  that 
establishes  a  stronger  upper  bound  for  the  general  caae  than  2  \y/n'\  —  1. 

Theorem  19  The  set  ofn  cyclic  shifts  on  n  elements  has  a  difference  cover  of  size  (v/2  + 

Sketch  of  proof.  [28]  Let  q  be  the  smallest  prime  such  that  I  =  q  +  \  >  n/2.  We 
have  g  =  (1  -f-  o{\))^Jn|2,  since  for  large  x,  there  exists  a  prime  between  x  and  x  -|-  o(j). 
Let  {do,  dj, . . . ,  d,}  be  a  difference  cover  for  integersi  chosen  as  in  Theorem  12.  It  can  be 
verified  that  the  set  {do,  dj, . . . ,  d,}  U  {do  -I-  /,  di  +  /,...,  d,  -f-  /}  forms  a  difference  cover  for 

Z„.  ■ 

Another  interesting  problem  related  to  cyclic  shifters  involves  finding  an  area-efficient 
VLSI  layout  of  the  cyclic  shifter  based  on  projective  planes.  In  section  4  we  presented  an 
area-efficient  layout  using  a  difference  cover  whose  size  is  twice  the  optimal  size.  Is  there 
a  good  layout  for  the  pin-optimal  design? 

In  Section  5,  we  showed  that  any  abelian  group  of  p  elements  has  a  difference  cover 
of  size  0{y/p),  and  we  showed  that  any  group  of  p  elements  has  a  difference  cover  of  size 
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0{\/pigp).  Finkelstein,  Kleitman  and  Leighton  [13]  have  recently  improved  our  result  for 
general  groups  to  0{y/p).  Their  proof  uses  a  folk  theorem  [8]  that  every  simple  group 
of  nonprime  order  p  has  a  subgroup  of  size  at  least  y/p .  The  folk  theorem  is  proved  by 
checking  each  type  of  group  in  the  classification  theorem  [17,  pp.  135-136].  It  would  be 
interesting  to  know  if  there  is  a  more  direct  proof  that  every  group  has  a  difference  cover 
of  size  0{y/p). 

To  implement  cyclic  shifters  that  operate  in  t  clock  ticks,  we  showed  how  to  construct 
a  f-difference  cover  for  Z„  of  size  A  simpler  construction  achieves  the  bound 

Theorem  13  gives  a  lower  bound  of  f>/«l  on  the  average  number  of  pins  per 
chip  for  a  cyclic  shifter  that  operates  in  one  clock  tick.  It  may  be  possible  to  prove  a  lower 
bound  of  on  the  average  number  of  pins  per  chip  when  an  architecture  operates  in 

t  clock  ticks,  but  we  were  unable  to  extend  the  argument.  We  were  also  unable  to  extend 
either  of  these  constructions  to  give  good  t-difference  covers  for  groups,  either  general  or 
abelian.  It  would  be  interesting  to  know  whether  any  abelian  group  of  permutations  with 
p  permutations  has  a  t-difference  cover  of  size  for  any  f  >  1. 

We  have  concentrated  primarily  on  permutation  sets  that  have  good  structure,  specif¬ 
ically  group  properties.  It  would  be  interesting  to  identify  other  structural  properties  of 
permutation  sets  besides  group  properties  that  allow  small  difference  covers  to  exist. 

Appendix 

For  completeness,  we  include  definitions  of  common  mathematical  notations  and  algebraic 
terms  used  in  the  paper.  Definitions  specific  to  the  content  of  the  paper  are  included  in 
context. 

We  adopt  the  following  notations: 

•  [X]  denotes  the  size  of  the  set  X. 

•  [n]  denotes  the  set  of  n  integers  {1,2, .. .  ,n}. 

•  [ij  (floor  of  x)  denotes  the  largest  integer  that  is  smaller  than  or  equal  to  x. 

•  [x]  (ceiling  of  x)  denotes  the  smallest  integer  that  is  larger  than  or  equal  to  x. 

•  Ig  X  denotes  log2  x. 

•  In  X  denotes  log^  x. 

•  (3  denotes 

For  two  a.symptotically  positive  functions  /(n)  and  5(n),  we  write: 

•  /(n)  =  o(p(n))  if  lim„_oo /(n)/y(n)  =  0. 

•  /(n)  =  0{g{n))  if  there  exists  c  >  0  and  no,  such  that  /(n)  <  cg{n)  for  all  n  >  ng. 


•  /(”)  =  f^(S'(”))  there  exists  c  >  0  and  tiq,  such  that  /(n)  >  cg(n)  for  all  n  >  hq. 

•  /(n)  =  0(if(n))  if  both  /(n)  =  0(^(n))  and  /(n)  =  fl(5r(n)). 

Let  /:  A  B  be  a  function. 

•  /  is  injective  (one  to  one)  ii  a  ^  b  implies  f{a)  ^  f{b). 

•  /  is  surjective  (onto)  if  for  all  be  B,  there  exists  some  a  G  A  such  that  b  =  f(a). 

•  /  is  bijective  if  it  is  injective  and  surjective. 

A  group  is  a  set  of  elements  G  with  a  binary  operation  ©,  such  that  the  following 
properties  hold. 

•  Closure:  For  every  a,b  G  G,  we  have  a®  b  G  G. 

•  Associativity:  For  every  a,b,c  G  G,  we  have  a  ©  (6  ©  c)  =  (a  ©  6)  ©  c. 

•  Identity:  There  exists  an  element  e  gG  such  that  a  ©  e  =  e  ©  a  =  a  for  all  a  6  G. 

•  Inverse:  For  every  a  G  G,  there  exists  an  element  G  G  such  that  a  ©  a"'  = 
a~^  ©  a  =  e. 

.An  abelian  group  is  a  group  G  with  an  additional  property: 

•  Commutativity:  For  every  a,b  G  G,  we  have  a  ©  6  =  6  ©  a. 

We  often  use  the  notations: 

•  ab  to  denote  a®  b, 

•  a*  to  denote  a  ©  a  ©  •  •  •  ©  a  (k  times), 

•  a“*'  to  denote 

.A  cyclic  group  G  is  a  group  in  which  there  exists  a  €  G  such  that  G  =  •[a*'  :  it  integer  1. 
Cyclic  groups  are  abelian.  The  notation  Z„  denotes  the  cyclic  group  of  residues  modulo 
n,  with  modular  addition  as  the  group  operation.  A  permutation  on  a  set  X  is  a  bijective 
function  from  X  to  X.  All  the  possible  permutations  on  X  form  a  group  with  functional 
composition  as  the  group  operation. 
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